suppressPackageStartupMessages(library(tidyverse))

Case study: how do features of nesting female horseshoe crabs influence the number of males found nearby?

Load the data. Here are the top six rows of 173 rows:

crab <- read_table("https://newonlinecourses.science.psu.edu/stat504/sites/onlinecourses.science.psu.edu.stat504/files/lesson07/crab/index.txt", col_names = FALSE) %>% 
  select(-1) %>% 
  setNames(c("colour","spine","width","weight","n_male")) %>% 
  mutate(colour = factor(colour),
         spine  = factor(spine))
## Parsed with column specification:
## cols(
##   X1 = col_integer(),
##   X2 = col_integer(),
##   X3 = col_integer(),
##   X4 = col_double(),
##   X5 = col_double(),
##   X6 = col_integer()
## )
knitr::kable(head(crab))
colour spine width weight n_male
2 3 28.3 3.05 8
3 3 26.0 2.60 4
3 3 25.6 2.15 0
4 2 21.0 1.85 0
2 3 29.0 3.00 1
1 2 25.0 2.30 3

Predictors: Colour; spine condition; carapace width; weight.

First, let’s see how carapace width influences the mean number of males nearby.

p <- ggplot(crab, aes(width, n_male)) + 
  geom_point(alpha=0.25) +
  labs(x = "Carapace Width", 
       y = "No. males\nnearby") +
  theme_bw() +
  theme(axis.title.y = element_text(angle=0, vjust=0.5))
plotly::ggplotly(p)

Data source: H. Jane Brockmann’s 1996 paper; found online here; another regression demo with this data is found here.

Approach 1: Estimate regression curve / model function locally

Preliminary questions

These questions are meant to check your understanding of local regression.

What is the estimated mean number of nearby males for nesting females having a carapace width of 32.5? Use the following methods, by hand.

1. kNN with \(k=3\).

2. Using a moving window with a radius of 2.4.

3. Using a kernel smoother with Gaussian kernel with variance 1.

4. Using local polynomials with a radius of 2.4 and a flat kernel, first with degree 1, then with degree 2.

Fit a smoother by eye

Optimize the loess fit by-eye. Just modify span, to keep things simple.

grid <- seq(min(crab$width), max(crab$width), length.out=100)
grid_df <- tibble(width = grid)
# FIT_MODEL_HERE
# PLOT_CURVE_HERE

What’s the error of this model? Training error is fine.

How well does this model answer our original question?

Approach 2: Linear Regression

Fit a linear regression model

Fit a linear regression model. What’s the error?

How well does this model answer our original question? Do you see a potential problem with this model? Are any assumptions of linear regression not true? Brainstorm ideas for how to deal with the problems.